I’ve been working on grammar‑constrained decoding for a while, and I want to write down the core idea in a form that’s (a) communicable and (b) timestamped.
This is a preliminary post: enough detail to explain the read‑the‑GLR‑stack with a compiled weighted automaton approach, but not a full “here’s how to build it in production without blowing up memory” tutorial. I’ll likely follow up with a post focused on compilation tricks, determinization order, sharing/interner strategies, and the ugly cases.
You have a grammar (often a JSON Schema, or some programming‑language‑ish grammar) and you want an LLM to produce output that is grammatically valid.
At each decoding step, the model produces logits over its vocabulary (V). You sample a token, append it to the output, repeat.
Constrained decoding means: don’t sample tokens that would make the output grammatically invalid.
So we maintain a constraint state, and at each step we compute a mask (M \subseteq V) of allowed tokens and apply it to the logits.
No more broken JSON.
At runtime you can think of constrained decoding as two functions:
get_mask(state) -> M returns a bitmask / set (M \subseteq V) over the LLM vocabulary.commit(state, token) -> state' updates the constraint state after you actually choose a token.In a typical inference loop you want this pipeline:
loop:
parallel:
GPU: compute logits
CPU: compute mask
apply mask to logits
sample token
commit token to model and constraint
This split is conceptually clean, and it’s also where most real systems live or die.
LLM inference is batched: multiple requests share the GPU. That means mask generation sits directly on the critical path of decoding.
If you want ~1k tokens/sec, you have ~1ms per token to get a mask back. And if you miss even a single step—if one mask takes 2ms instead of 1ms—you don’t just slow down one request:
So the bar isn’t “fast on average”. It’s:
can you do it fast every single step?
That’s the motivation for everything below.
commit is incremental parsing (GLR + a graph‑structured stack)The commit operation is incremental parsing: you have a growing prefix and you update the parse state as new bytes arrive.
For ambiguous grammars or ambiguous lexing, the standard tool is Generalized LR (GLR) parsing with a graph‑structured stack (GSS):
This is well‑trodden territory. Tree‑sitter and Lezer both live in the “GLR-ish + GSS” world.
NUMBER, {, IDENT).At a decoding step you have a constraint state (often a GSS), and you need a mask over LLM tokens.
The immediate trap is to treat “try a token” as “append that token to the parser and see if it errors”.
But you don’t append LLM tokens to the parser.
You append terminal symbols.
And one LLM token is a byte string which may:
"}\n" → RBRACE NEWLINE, etc.),So the real question isn’t:
does token (v) work?
It’s:
does token (v)’s byte string admit some lexing into terminals that the parser can consume from some stack path in the current GSS?
You can imagine:
This is easy to describe and painful to run.
A common improvement is to avoid iterating over the whole vocabulary by building a trie/trellis:
This often works well on average, especially for simple grammars.
But if you care about predictable sub‑millisecond p99.9, two issues keep recurring.
Even with a deterministic lexer, a single terminal shift can trigger a chain of reductions. In LR:
Those reduction chains are fast in plain LR, but in GLR they can fan out. And GLR stack manipulation is pointer‑heavy and cache‑unfriendly: GSS nodes, merges, pruning, conflict handling.
If you do this inside mask generation—while walking a token trie—you’re effectively running a speculative GLR parser for lots of hypothetical token continuations, every step.
That’s exactly the kind of work you don’t want on the latency critical path.
Even with tries, there exist single BPE tokens whose byte strings correspond to a long terminal sequence.
Example: Python accepts absurd unary chains:
>>> ----------------1
1
Those sixteen hyphens can easily be one LLM token (depending on tokenizer), but lex as 16 MINUS terminals:
You can normalize the grammar, memoize, simplify GSS nodes, prune aggressively… and some of these tokens still show up as worst‑case outliers.
My experience: if you want predictable performance on arbitrary inputs/grammars, “trie + speculative incremental parse inside masking” is a dead end.
Token validity depends heavily on what’s on the LR stack.
So instead of asking:
for each token (v), does it work on this stack?
ask:
given this stack, which tokens (v) work?
This reframing changes what you need at runtime. You need a data structure that maps:
stack configuration → allowed vocabulary tokens
…and you need to execute it with predictable work per step.
That’s where the weighted automaton comes in.
Think of a finite automaton, except transitions carry weights and path traversal combines weights.
In the version that matters for masking:
At runtime you “feed” the automaton a representation of your current parse configuration:
A vocabulary token is valid if it’s valid on any parse path in the GSS, so ambiguity corresponds naturally to union at merges.
If you squint at LR theory, this is the same “viable prefixes are regular” vibe: the set of stack configurations where a terminal is shiftable/reducible is regular (recognizable by a finite machine) when you treat stacks as words over state IDs.
That’s the conceptual bridge: “stack languages are regular enough” that an automaton can read them.
Operationally, masking becomes “flow token sets through a tiny transition system while reading a bounded suffix of the stack”:
def get_mask(stack_state_ids_top_to_bottom):
frontier = { A.start: ALL_TOKENS } # map automaton_state -> token_set
for sid in stack_state_ids_top_to_bottom:
new = {}
for a_state, tokens in frontier.items():
for (a2, weight) in A.step(a_state, sid):
tokens2 = tokens & weight # ∩ filter
if tokens2:
new[a2] = new.get(a2, EMPTY) | tokens2 # ∪ merge
frontier = new
# stop once further stack symbols can't change the result
if frontier_is_decided(frontier):
break
return tokens_accepted_by(frontier)
The important point is what’s not happening:
That’s the kind of work you can make very stable in worst‑case latency.
For GLR you don’t have one stack; you have a DAG of them.
Conceptually:
When GSS paths merge, you union token sets (a token is valid if it works on any stack path). When automaton paths merge, you also union.
A practical note: if you want this to be fast, your GSS representation can’t just be “edges with labels”. Nodes generally need to carry and merge some accumulator information so that “union at merges” is cheap and you don’t re-walk equivalent subgraphs repeatedly. This is doable, but it’s one of the “engineering details that deserve their own post.”
So far I’ve treated “the weighted automaton” as an opaque object you somehow possess.
A better mental model is:
you compile away the expensive speculation up front, so
get_maskbecomes a bounded, read‑only graph walk at runtime.
There are two pieces you have to reconcile:
The compiled object that powers masking is essentially a composition of:
into one final Parser DWA that reads stack state IDs and outputs allowed‑token sets.
I use “DWA” here as shorthand for a deterministic weighted automaton: deterministic control flow, with weights living in a semiring‑ish world where we intersect along paths and union at merges.
An LLM token is a byte string. A lexer consumes a byte stream and emits grammar terminals—sometimes none yet (mid‑token), sometimes one, sometimes several.
So what we need is not “what terminal is next?”, but:
given a current lexer configuration, what terminal sequences can a single vocabulary token realize?
The Terminal DWA represents this mapping:
A convenient construction is:
Then determinize / compress.
The Terminal DWA is acyclic because it’s fundamentally derived from finite token byte strings: every step corresponds to consuming more of a finite vocabulary entry. There’s no way to loop and keep making progress inside a single token.
This matters later: acyclicity turns into a hard bound on how much work you can do per mask step.
Now fix a grammar terminal (t). In LR/GLR parsing, “process terminal (t)” is not just “shift (t)”.
It often means:
Doing that reduction simulation at runtime for many hypothetical tokens is what we’re trying to avoid.
So for each terminal (t), we precompute a compact automaton—a template—that describes when (t) can be processed as a function of the top of the LR stack (and what kind of stack interaction is required).
This is tightly related to classical GLR machinery. A reference that’s directly relevant here is:
Among other things, this line of work analyzes how reduction behavior can be structured so that the relevant “look down the stack to decide reductions” behavior is bounded after appropriate grammar transformations.
To make these templates composable, you want an algebra for “what this terminal does to the stack” that:
That’s exactly what the polycyclic monoid models.
At a high level:
Why this is useful here:
This isn’t a cute optimization; it’s the core mechanism that lets you precompute “what happens when I consume a whole terminal sequence” without simulating reductions step-by-step at runtime.
Depending on grammar shape, reduction behavior can create cycles in these templates (think recursion that keeps letting you reduce in ways that don’t structurally make progress).
With the right grammar normalizations (the same family of ideas that show up in “faster GLR” work), the per-terminal templates you actually need for masking can be made acyclic.
This is the point where “you only need a bounded suffix of the stack” becomes more than a heuristic: acyclicity implies a hard bound on how many stack states the template ever needs to inspect.
Now you compose:
to get a single Parser DWA that:
Two key points fall out of this composition:
When you concatenate templates for terminals (t_1 t_2 \dots t_k), you’re composing their stack interactions.
Adjacent push/pop interactions cancel. Mismatches fail. This cancellation is what prevents “long token produces many terminals” from turning into “runtime does proportional parser work.” You paid that cost at compile time and compiled it into bounded lookups.
Because:
their composition is acyclic as well.
Operational consequence: there is a fixed maximum depth (D) such that get_mask never needs to inspect more than (D) stack symbols to determine token validity (for the compiled grammar / automaton).
This is exactly the kind of predictability you want when you care about p99.9.
No.
At runtime:
commit is responsible for actually advancing the real lexer + GLR parser state.get_mask only needs to answer “does there exist a viable way to consume this vocabulary token next?”During compilation, you often need richer “stack interaction” information (push/pop structure) to correctly compose terminal sequences. But once you’ve compiled down to a machine that filters tokens based on the current stack, you can project away any information that’s only needed to construct the next stack. Masking is a read-only query.
There’s an additional mess you have to handle in any streaming setup:
Classic example: "+" vs "++". After you see the first "+", you have a valid match for PLUS, but it’s also extendable to INCREMENT.
Incremental parsing frameworks deal with this in a standard way: keep both hypotheses alive until you see enough input to disambiguate, then prune the losing one.
In my implementation I sometimes refer to this as “inhibited” (or “provisional”) terminals:
This isn’t meant as a claim of novelty; it’s just a convenient operational label for the usual “speculate, branch, prune” pattern, and it meshes naturally with GLR+GSS.
Practically, your constraint state is not just “parser stacks”; it’s:
(a GSS of parser stacks) + (a set of active lexer configurations)
…and masking needs to be conditioned on both. In the compiled view, that usually means you have multiple initial configurations for the Terminal/Parser DWA (one per active lexer hypothesis), and you union the resulting token sets.
You end up with a deliberate split:
commit(state, token) (real parsing)This can still degrade on adversarial ambiguity (GLR is not a magic shield), but you’re at least sitting on decades of incremental parsing practice.
get_mask(state) (compiled query)The runtime work is predictable: table lookups and set ops on a bounded walk—no speculative reductions, no per-token lexing, no “ugly token” surprise loops.
Mask computation becomes:
Crucially, masking is no longer:
The cost shifts to compile time.
Building the Parser DWA involves determinization and heavy simplification in a “weights are sets” world (union at merges, intersection along paths). If you’re not careful, large grammars can blow up in memory/time.
Most of my engineering effort has gone into making compilation behave:
That’s the part I’ll likely write up next, because it’s where “this looks elegant on paper” becomes “this actually runs in production.”
If you care about predictable p99.9 latency, the main lesson is simple: keep real parsing work out of get_mask—compile the speculation away ahead of time, and make masking a bounded, cache-friendly read of the current GLR stack.